-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Fix autotune thread safety (crash) under GIL=0 #8437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… into fix-nogil-autotune
|
I probably didn't get the use of thread local here. Doesn't it mean that we need to autotune T times if there are T concurrent threads? |
|
Also if it happens that two threads use the same GPU, the measured perf won't be the same as the original case where a single thread uses the GPU. As kernels will be launched and executed in an interleaving way from each thread |
Having a global cache or lock doesn't help this situation. You have two threads bound to gpu:0 and one thread is already running pre-tuned kernel work while the second thread also boudn to gpu:0 is entering autotune for the first time on a different kernel. I don't think we can control this. It is up to the end-user to launch new kernel serially and have exclusive locks to gpu:index to make sure the autotune benches are consistent and accurate. |
Yes. This is a the down side of thread-local. I think for most cases, end-users wil launch a thread and have it persist for the life time of the model execution instead of luanching random threads each time they need to execute an kernel. The cost of launching threads and ctx switch is quite high. Maybe my view of how threads should be used is different from reality but in my view, users should not be tearing up and down gpu bound execution threads just like you don't launch and tear up gpu processes but have them persist for the duration of the program. |
|
@Jokeren I think global cache and thread-local cache has both advantages and disadvantages. In my world ( |
|
I believe that a thread-local implementation may introduce additional inconsistencies. |
f5d62c3 to
2debf8d
Compare
cb06283 to
10a5601
Compare
@Jokeren Please re-review. Here are the changes since your last review:
|
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
|
Let's keep the PR open until we have figured out more complete solutions for GIL=0. |
Yes. This PR only fixes the |
Not yet. So I think we don't plan to merge this PR in the short term. cc @ThomasRaoux |
Ok. I am hoping this high impact, low risk and low hanging fruit PR will get the consideration before a full-on thread-safe triton roadmap. I am also willing to spend more time on this PR to address any lingering concerns of accuracy, safety or execution pattern deviation from |
Triton kernel on first launch will call
_benchand related code to cache best config to use for subsequent kernel launches. This part of the code is racy and will crash in GIL=0 (Python >= 3.13T). The crash will manifest in crypticNoneObjecthas no xx attribute errors to the end-user when in reality it's a data race issue.I encountered this doing
GIL=0thread based module parallel execution on multiple gpu in GPT-QModel, a quantization toolkit which has triton kernels.EDIT: Please ignore the folowing block. PR has been updated to use global locks + per key future for multi-threading autotuning sync.
------ Out Dated ---
There were two methods of protection,
locksorthread-local. I implementedthread-localfor the following reasons:thread-local(no locks) will be much faster for cache retrieval vs lock protected global cache.thread-localhas obvious downside in persistence and duplication of triton configs + extra autotunes per gpu thread. I consider this a small downside with upsides.Upsides:
If you use python
GIL=0and care about performance, you will likely use thread pools and not launching a thread every single time. This nullifies any persistence downside issue in real-world applications. Launching new thread for every kernel execution is bad practice (as thread startup is much slower than any thing code/lock wise by many factors) and we should not worry too much optimizing for bad code practices.If you want to further optimzie your threading code, you will likely use persistent threads and bind each thread to a specific cuda:index. With global cache, triton would assume all cuda:index are the same gpu. This actually not the case for my setup and we shouldnt make this assumption for others. Benchmark/config-cache should be keyed to per cuda:index (unique gpu fingerprint) to be frank, but this for another PR/topic. With thread-local we actually achieve optimized kernel launch per cuda:index (potentially unique gpu) since developer should already persist a thread in pool and bind the thread to a cuda:index.
As such, I believe thread-local is a better option over global cache (with lcoks): more scalable, lower latency, and more accurate (indirectly). Unsure how much duplicated config memory this would cost so but I think it's minimal?
test_autotuner_thread_safetyextra unit test has been added to existing to existingpython/test/unit/runtime/test_autotuner.pytest file. Unit test should run onGIL=0env. Triton CI should setup a Python 3.14T runner for this test.New contributor declaration
I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD.Select one of the following.
test_autotuner_thread_safetyinside existingpython/test/unit/runtime/test_autotuner.pySelect one of the following.
littests.